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(57) ABSTRACT 

A group and virtml locking mechanism (G VLM) addresses 
two classes of synchronization present in a system having 
resources that are shared by a plurality of processors: (1) 
synchronization of the multi-access shared resources; and 
(2) simultaneous requests for the shared resources. The 
system is a programmable processing engine comprising an 
array of processor complex elements, each having a micro- 
controller processor. The processor complexes are prefer- 
ably arrayed as rows and columns. Broadly stated, the novel 
G VLM comprises a lock controller function associated with 
each column of processor complexes and lock instructions 
executed by the processors that manipulate the lock con- 
troller to create a tightly integrated arrangement for issuing 
lock requests to the shared resources. 
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TITLE: Power-and speed-efficient data storage/transfer architecture models and 
design methodologies for programmable or reusable multi-media processors 

Abstract Text (1) : 

A programmable processing engine and a method of operating the same is described, 
the processing engine including a customized processor, a flexible processor and a 
data store commonly sharable between the two processors. The customized processor 
normally executes a sequence of a plurality of pre-customized routines, usually for 
which it has been optimized. To provide some flexibility for design changes and 
optimizations, a controller for monitoring the customized processor during 
execution of routines is provided to select one of a set of pre-customized 
processing interruption points and for switching context from the customized 
processor to the flexible processor at the interruption point. The customized 
processor can then be switched off and the flexible processor carries out a 
modified routine. By using sharable a data store, the context switch can be chosen 
at a time when all relevant data is in the sharable data store. This means that the 
flexible processor can pick up the modified processing cleanly. After the modified 
processing the flexible processor writes back new data into the data store and the 
customized processor can continue processing either where it left off or may skip a 
certain number of cycles as instructed by the flexible processor, before beginning 
processing of the new data. 

Brief Summary Text (15) : 

Many of such architectures have been proposed for video and image processing. Power 
management and power reduction for these processors is hardly tackled in literature 
but it is recognized as a growing problem in the industry (at least at the 
"customer" side) . Several recent commercial multi-media oriented processors have 
been marketed or announced: TI-C80 and recently C60, Philips-TriMedia, Chromatic- 
Mpact, Nvidia NVl, NEC PIP-RAM. Several other Super-scalar/VLIW processors have 
been announced with an extended instruction-set for multi-media applications: Intel 
(MMX), SGI/MPS (MDMX), HP (MAX), DEC (MVI), Sun (WIS), AMD (MMX) , IBM (Java). Also 
a few more dedicated domain-specific ASIP processors have been proposed, such as 
the MIPS MPEG2 engine which includes a multi-RISC, several memories and a 
programmable network. 

Brief Summary Text (33) : 

The present invention includes a programmable processing engine, the processing 
engine including a customized processor, a flexible processor and a data store 
commonly sharable between the two processors, the customized processor normally 
executing a sequence of a plurality of pre-customized routines, comprising: a 
controller for monitoring the customized processor during execution of a first code 
portion to flexibly select one of a set of pre-customized processing interruption 
points in a first routine and for switching context from the customized processor 
to the flexible processor at the selected interruption point. 

Brief Summary Text (34): 

The present invention also includes a method of operating a programmable processing 
engine, the processing engine including a customized processor, a flexible 
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processor and a data store commonly sharable between the two processors, the 
customized processor normally executing a sequence of a plurality of pre-customized 
routines, comprising the steps of: monitoring the customized processor during 
execution of a first code portion to flexibly select one of a set of pre-customised 
processing interruption points in a first routine; and switching context from the 
customized processor to the flexible processor at the selected interruption point. 
The method preferably includes as a next step executing a second code portion on 
said flexible processor using at least a part of first data left in the data store 
by the execution of the first code portion on the customized processor. 

Detailed Description Text (63) : 

All communication between the system processor 32 and the cACU 31 is therefore 
mainly devoted to resolve only run-time data-dependencies and to explicitly 
synchronize the evolution of both threads of control (the IP 32 and cACU 31) . For 
the master-master model, this explicit synchronization is limited to the specific 
points where the original functionality has to be modified (namely at a context 
switch) , as described above with respect to the second embodiment of the present 
invention, i.e. a context switch takes place whenever a new functionality is added 
or modified with respect to the initial version of the application, which means 
that some parts of the original functionality assigned to the cACU 31 are taken 
over (in a modified form) by the IP 32. 

Detailed Description Text (64): 

Sometimes, the IP 32 needs to follow the traversal of the cACU 31, ■ specif ically 
when data-dependent conditional branches are being decided locally in the cACU 31 
and the context switch needs to happen inside one or several of them. Normally, the 
traversed paths result in unbalanced conditional threads ■ Thus, it becomes very 
difficult to predict when the context switch should happen. To concurrently match 
the unbalanced evolution several design options are possible all of which are 
aspects of this embodiment of the present invention: 

Detailed Description Text (67) : 

This architecture allows to define a synchronous co-operative model, mainly 
constrained by the order of the memory accesses that are subject to custom 
addressing. Otherwise, both performance and cost (area/power) efficiencies can be 
very limited. Also, the use of related clocks derived from the same system clock 35 
is mainly needed for efficiency in the high-level synchronization between both 
threads of control (the IP 32 and the cACU 31) . In the master-master architectures, 
it affects the synchronization of both schedules (IP 32 and cACU 31) . In the 
master-slave case, it affects the pipelining of the operations of both blocks. In 
case of related clocks suffering clock-skew problems, it is possible to use a low- 
level asynchronous protocol for the communication of the data-dependencies, 
independently of the model chosen. 

Detailed Description Text (68) : 

A first model allows a master-master concurrent evolution of the two threads of 
control (one for cACU 31 and one for the IP 32) . This model is a specific case of 
the custom block described in a more generic way above in the first embodiment of 
the present invention and refined for a cACU 31 as an individual embodiment of the 
present invention. The synchronization in this case is implicit by 
estimating/controlling, but at compile time, the elapsed time between the memory- 
transfers subject to custom addressing. Typically, for data-transfer intensive 
applications, the memory accesses to the slower memories dominate the overall 
system timing. Thus, one possibility is to concentrate on the memory-transfer and 
assume that the compiler will schedule the transfers such that they are kept in the 
same order as in the original algorithm. Therefore, by controlling the compiler (as 
it is the case for some ASIP's) it is possible to predict when the memory-transfer 
is taking place. Another possibility to gather this timing information is by 
analyzing the scheduling output of the compiler. Note, that there is no need to 
have all details of the scheduling but just these related to the memory access 
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operations which are subject to custom addressing. In both cases, no control-flow 
decisions shall be left at run-time (e.g., hardware-cache, interrupts, etc.) to 
prevent modifications on the scheduling after compilation. 

Detailed Description Text (96) : 

Three instructions are needed to perform the complete normal and modified operating 
modes plus the already existing load/store operation. Two of them, data-ready and 
skip-acu (#states) are common to both embodiments: master-master and master-slave. 
However, the third instruction differs slightly in semantics. In both models the 
third instruction "means" that an action in the cACU 31 should be started. In the 
master-master model the "start" action triggers or continues the control thread of 
the cACU 31 (start-acu) . In the master-slave model, it just triggers one super- 
state transition of the corresponding control -thread . 

Detailed Description Text (124) : 

The implementation of the third embodiment comes partly at the price of reduced 
flexibility compared to general-purpose RISCs but especially needs a heavy 
investment in new processor architecture design and compiler technology. The 
potential savings will be even larger than in the previously described embodiments, 
however. Recently, some companies have been proposing very domain-specific 
processors for a limited class of applications. An example is the programmable MIPS 
MPEG2 engine . These processors still have a limited scope and the data transfer and 
storage bottle-neck have not been really solved, especially in terms of power. The 
performance of the architecture model of the third embodiment can be summarized as: 



Current US Original Classification (1) : 
712/34 

Current US Cross Reference Classification (1) : 
712/40 

CLAIMS: 

1. A programmable processing engine, the processing engine including a customized 
processor, a flexible processor and a data store commonly sharable between the two 
processors, the customized processor normally executing a sequence of a plurality 
of pre-customized routines the programmable processing engine, comprising: 

a controller for monitoring the customized processor during execution of a first 
code portion to select one of a set of pre-customized processing interruption 
points in a first routine and for switching context from the customized processor 
to the flexible processor at the interruption point. 

3. The processing engine of claim 1, wherein the data store is data storage shared 
commonly by both the customized and the programmable processor. 

5. The processing engine of claim 1, wherein the programmable processor is an 
application specific instruction set processor. 

7. The processing engine of claim 1, wherein the programmable processor includes a 
counter means for determining the timing of the context switch. 

8. The processing engine of claim 1, wherein the customized processor is adapted 
supply information to the programmable processor sufficient to determine the timing 
of the context switch. 

9. The processing engine of claim 8, wherein the programmable processor is adapted 
to monitor the branch evolution in the customized processor. 
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10. The processing engine of claim 8, wherein the programmable processor has a 
register and the customized processor is adapted to transmit information relating 
to the status of routines running on the custom processor for storage in said 
registers . 

19. A method of operating a programmable processing engine, the processing engine 
including a customized processor, a flexible processor and a data store commonly 
sharable between the two prol:essors, the customized processor normally executing a 
sequence of a plurality of rare-customized routines the method, comprising: 



monitoring the customized processor during execution of a first code portion to 
select one of a set of pre-customized processing interruption points in the first 
routine; and 

switching context from the customized processor to the flexible processor at the 
interruption point. 
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(57) 



ABSTRACT 



A programmable processing engine and a method of oper- 
ating the same is described, the processing engine including 
a customized processor, a flexible processor and a data store 
conunonly sharable between the two processors. The cus- 
tomized processor normally executes a sequence of a plu- 
rality of pre-customized routines, usually for which it has 
been optimized. To provide some flexibility for design 
changes and optimizations, a controller for monitoring the 
customized processor during execution of routines is pro- 
vided to select one of a set of pre-customized processing 
interruption points and for switching context from the cus- 
tomized processor to the flexible processor at the interrup- 
tion point. The customized processor can then be switched 
off and the flexible processor carries out a modified routine. 
By using sharable a data store, the context switch can be 
chosen al a time when all relevant data is in the sharable data 
store. This means that the flexible processor can pick up the 
modified processing cleanly. After the modified processing 
the flexible processor writes back new data into the data 
store and the customized processor can continue processing 
either where it left off or may skip a certain number of cycles 
as instructed by the flexible processor, before beginning 
processing of the new data. 

30 Claims, 23 Drawing Sheets 
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